Week 8 - Support Vector Machines¶

Dr. David Elliott¶

1.7. Hyperparameters

1.8. Imballanced Data

1.9. Multi-Class

1.10. Strengths and Limitations

Notes

  • The idea for this workbook/lecture is to focus on demonstrating the models use on different datasets.

The Iris dataset is useful for demonstrating SVM's, but are a bit "too small to be representative of real world machine learning tasks"1.

Furthermore, when comparing to other methods, we will probably do well no matter how our hyperparameters are set. So for this lecture/workbook I'll demonstrate its use on larger (yet still manageable) datasets.

Choosing Hyperparameters ¶

Dataset Example: Abalone¶

The Abalone Dataset2

Name Data Type Meas. Description
Sex nominal M, F, and I (infant)
Length continuous mm Longest shell measurement
Diameter continuous mm perpendicular to length
Height continuous mm with meat in shell
Whole weight continuous grams whole abalone
Shucked weight continuous grams weight of meat
Viscera weight continuous grams gut weight (after bleeding)
Shell weight continuous grams after being dried
Rings integer +1.5 gives the age in years
Sex Length Diameter Height Whole weight Shucked weight Viscera weight Shell weight Rings
0 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.150 15
1 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.070 7
2 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.210 9
3 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.155 10
4 I 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.055 7

There are three classes for SVM classification in Scikit-Learn3:

Class Time Complexity Out-of-core Support Kernel Trick
LinearSVC 0(m x n) No No
SGDClassifier 0(m x n) Yes No
SVC 0(m2 x n) to 0(m3 x n) No Yes

Notes

  • "Standard solvers take $O(N^3)$ time. However, specialized algorithms, which avoid the use of generic QP solvers, have been developed for this problem, such as the sequential minimal optimization or SMO algorithm (Platt 1998). In practice this can take $O(N^2)$. However, even this can be too slow if $N$ is large. In such settings, it is common to use linear SVMs, which take $O(N)$ time to train (Joachims 2006; Bottou et al. 2007)."4

First lets make a pipeline with two steps:

  1. Standardize the features
  2. LinearSVC / SVC

Its always important to standardise your data first as...

[INSERT ABOUT standardising AND EFFECT ON SVM]

You shouldnt stick to the default settings, instead you want to search for optimal hyperparameters for the data. Good suggestions for search spaces are:

  • "The authors of libsvm (Hsu et al. 2009) recommend using cv over a 2d grid with values $C \in \{2^{-5}, 2^{-3}\,\ldots,2^{15}\}$ and $\gamma \in \{2^{-15}, 2^{-13}, \ldots,2^{3}\}$"4.

Notes

  • took around 13 mins on my laptop to do all 220 candidates

Random search¶

"Check this great blog post at Dato by Alice Zheng, specifically the section Hyperparameter tuning algorithms.

I love movies where the underdog wins, and I love machine learning papers where simple solutions are shown to be surprisingly effective. This is the storyline of “Random search for hyperparameter optimization” by Bergstra and Bengio. [...] Random search wasn’t taken very seriously before. This is because it doesn’t search over all the grid points, so it cannot possibly beat the optimum found by grid search. But then came along Bergstra and Bengio. They showed that, in surprisingly many instances, random search performs about as well as grid search. All in all, trying 60 random points sampled from the grid seems to be good enough." https://stats.stackexchange.com/questions/160479/practical-hyperparameter-optimization-random-vs-grid-search

TODO

  • use the distributions above for my random search params to show this way may be quicker

Some models do require changing hyperparameters more than others. Here we see that searching for parameters improves our RBF scores, although no linear SVM much.

Todo

  • use the gridsearchcv validation set for this, dont bother making three separate sets! Then make this a boxplot.

Imballanced Data ¶

Dataset Example: Default¶

Default: Customer default records for a credit card company

"We are interested in predicting whether an individual will default on his or her credit card payment, on the basis of annual income and monthly credit card balance."5

default student balance income
1 No No 729.526495 44361.62507
2 No Yes 817.180407 12106.13470
3 No No 1073.549164 31767.13895
4 No No 529.250605 35704.49394
5 No No 785.655883 38463.49588

"We have plotted annual income and monthly credit card balance for a subset of 10, 000 individuals"5

"It appears that individuals who defaulted tended to have higher credit card balances than those who did not."5

Below is a recreation of figure 4.15

Notes

  • "It is worth noting that Figure 4.1 displays a very pronounced relationship between the predictor balance and the response default. In most real applications, the relationship between the predictor and the response will not be nearly so strong. However, for the sake of illustrating the classification procedures discussed in this chapter, we use an example in which the relationship between the predictor and the response is somewhat exaggerated."5

The learning and prediction of machine learning algorithms tend to be affected by imballances6.

Class Distribution (%)
No     96.67
Yes     3.33
Name: default, dtype: float64

Notes

  • data is shuffled and stratified in the train_test_split

We'll do a random search and we can find a model with high accuracy.

Best Linear Model Accuracy: 96.56%

However, this is not much better than a completely useless model that only predicts "No".

Notes

  • "...since only 3.33 % of the individuals in the training sample defaulted, a simple but useless classifier that always predicts that each individual will not default, regardless of his or her credit card balance and student status, will result in an error rate of 3.33 %. In other words, the trivial null classifier will achieve an error rate that null is only a bit higher than the LDA training set error rate."

TODO

  • change to sample rates in the validation data
Useless Model Accuracy: 96.11%

"In practice, a binary classifier such as this one can make two types of errors: it can incorrectly assign an individual who defaults to the no default category, or it can incorrectly assign an individual who does not default to the default category. It is often of interest to determine which of these two types of errors are being made. A confusion matrix, shown for the Default confusion data in Table 4.4, is a convenient way to display this information." Intro to stats learning

"So while the overall error rate is low, the error rate among individuals who defaulted is very high. From the perspective of a credit card company that is trying to identify high-risk individuals, an error rate of 80\% among individuals who default may well be unacceptable." Intro to stats learning

Using this confusion matrix we can manually work out a number of different performance metrics.

Error and Accuracy

Give general performance information regarding the number of all correct or false predictions comparative to the total number of predictions for both the positive and negative labels.

Recall (or True Positive Rate)

Calculates how many of the actual positives our model correctly or incorrectly labelled. This is useful when the fraction of correctly or misclassified samples in the positive class are of interest.

Precision (PRE)

Precision gives information on how precise your model is by looking at how many positive predicted labels are actually positive. To calculate it you just take the true positives and divide it by the total true and false positives. Precision is a good measure to determine, when the costs of a False Positive is high.

F1-score

F1-score is a combination of Recall and Precision used when there is an uneven class distribution due to a large number of Actual Negatives that you are not as focused on.

Different metrics maybe more appropriate depending on the application.


  1. Raschka, Sebastian, and Vahid Mirjalili. Python Machine Learning, 2nd Ed. Packt Publishing, 2017.
  2. https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9

TODO

  • update with algebra/description from thesis

Furthermore we can use a classification report, which gives more information such as the micro avg, macro avg, and weighted avg; although we can still specify this in the previous method using the average parameter.

Macro Average

  • Treats all classes equally as each metric is calculated independently and the average is taken.

Micro Average

  • Aggregates the contributions of all classes and computes an average metric.
  • This is preferable to macro avg in cases with class imbalances.

Weighted Average

  • Each contribution to the average is weighted by the relative number of examples in a class available.

https://datascience.stackexchange.com/questions/15989/micro-average-vs-macro-average-performance-in-a-multiclass-classification-settin

Notes

  • The support is the number of occurrences of each class in y_true.
  • Notice how the previous metrics were based on the "yes" class. This is because in binary classification problems, the default positive label is the target (class 1). You can change this if you are more interested in the other classes performance or the average metrics.
No Yes accuracy macro avg weighted avg
precision 0.97 0.70 0.97 0.83 0.96
recall 1.00 0.20 0.97 0.60 0.97
f1-score 0.98 0.31 0.97 0.65 0.96
support 865.00 35.00 0.97 900.00 900.00

Potential Reasons for Poor Performance (i.e. Sensitivity)¶

  1. Training error > test error: "First of all, training error rates will usually be lower than test error rates, which are the real quantity of interest. In other words, we might expect this classifier to perform worse if we use it to predict whether or not a new set of individuals will default. The reason is that we specifically adjust the parameters of our model to do well on the training data. The higher the ratio of parameters p to number of samples n, the more we expect this overfitting to play a role. For overfitting these data we don’t expect this to be a problem, since p = 2 and n = 10, 000. Intro to stats
  2. Overall Accuracy: During hyperparamter cross-validation we are choosing the model with the best overall accuracy. This gives us a model with the smallest possible total number of misclassified observations, irrespective of which class the errors come from5.
  3. Decision Function: The decision function of a linear SVM may favour the majority class6.

Potential Solutions for Poor Performance (i.e. Sensitivity)¶

There are a number of methods available to address imbalances in a dataset, such as:

  1. Stratify the k-fold,
  2. Weighting the classes in the model during training,
  3. Changing the training metric,
  4. Resampling the data.

Stratified k-fold¶

Some of the fold may not have the same amount of data in, so the validation error we get from models may be a poor estimate of performance.

KFold
Fold 0 Fold 1 Fold 2 Fold 3 Fold 4
0 1565 1572 1577 1570 1560
1 55 48 43 50 60
StratifiedKFold
Fold 0 Fold 1 Fold 2 Fold 3 Fold 4
0 1568 1569 1569 1569 1569
1 52 51 51 51 51

Weights¶

We can ballance the weights. In sklearn this is done using the following formulation:

[Algebra]

You can assign this as you like.

Changing Training/Validation Metric¶

Changing the metric for what is defined as the "best model" can help us prioritise models that make particular errors.

For example, a credit card company might particularly wish to avoid incorrectly classifying an individual who will default, whereas incorrectly classifying an individual who will not default, though still to be avoided, is less problematic. In this case, recall would therefore be a useful metric to use.

Notes

  • Now we see that balanced models are indeed better if we want a good average recall or f1

TODO

  • I don't need to really be doing the hyper-search twice just to fit one model, I could just do it once, find the best model according to my metric, and then train a model with the parameters... however I'm lazy and I've made this helper function so its less work to just retrain again just with a different refit.

Resampling¶

Under-Sampling¶

A fast way to balance the data is just to randomly select a subset of the data for each class so they have the number of datapoints found in the smallest class.

Notes

  • RandomUnderSampler is part of the Imblearn package, which allows for a lot of techniques for working with imballanced data.
  • There is a resample method in scikit-learn but Imblearn is a bit smoother to work with.

Note

  • Make sure your sampler is in the pipeline, otherwise you'll be training and testing your data on a smaller/larger sample than normal and get unrepresentative results!
    • this is an easy mistake to make... I did it when setting this up :')

Oversampling¶

Data can be oversampled easily by randomly sampling from minority classes with replacement to duplicate original samples.

Notes

  • make sure to oversample after splitting the training and validation sets or you may "bleed" information into the validation sets of the model when trying to test a model7... In-other-words, make sure it is in a pipeline!

Best Approach?¶

There is no one best approach, its typically dependent on the data and the aims for the model.

Below is examples of cross-validation scores for the best models (according to F1-score) for the different approaches.

Notes

  • If over- and under-sampling are not much different, which in a lot of cases they won't be, you would typically favor undersampling for the savings you get in computational cost.

Using figure [the one above the one above], for the client who wants the model to prioritise avoiding incorrectly classifying an individual who will default, we would probably choose the undersampled linear SVM.

As we can see on the test set, we get similar scores as we did on the validation.

Multi-Class ¶

TODO

  • add in the limitations of multi-class SVM from ML a probablistic perspective (pg. 509)

Some models (e.g. tree-based classifiers) are inherently multiclass, whereas other machine learning algorithms are able to be extended to multi-class classification using techniques such as the One-versus-Rest or One-versus-One methods3.

One-verses-all (or One-vs-the-rest)¶

The One-verses-all approach is were you train a classifier for each class and select the class from the classifier that outputs the highest score3.

In other terms, if we fit $K$ SVMs, we assign a test observation, $x^*$, to the class for which $\beta_{0k} + \beta_{1k}x^*_1, ...,\beta_{pk}x^*_p$ is largest (the most confident)5.

Advantage: As each class is fitted against all other classes for each classifier, it is relatively interpretable14.

Disadvantages: Can result in ambiguious decision regions (e.g. could be class 1 or class 2), and classifiers could suffer from issues of class imballance4.

Notes

  • for ease of visualisation we'll only use two features.
  • although now inbuilt into sklearn.svm.SVC, you could also put the SVC inside sklearn.multiclass.OneVsRestClassifier

OneVsOneClassifer¶

Another strategy is to use a OneVsOne approach.

This trains $N \times (N-1) / 2$ classifiers by comparing each class against each other.

When a prediction is made, the class that is selected the most is chosen (Majority Vote)3.

Advantage: It is useful where algorithms do not scale well with data size (such as SVM) because each training and prediction is only needed to be run on a small subset of the data for each classifer3,14.

Disadvantages: Can still result in ambiguious decision regions and be computationally expensive4.

Notes

  • Next week we'll talk more about similar ensemble approachs to classification.

Strengths and Limitations ¶

TODO

  • maybe think about the sinario where we compare classifiers on artificial datasets geared towards the strengths and weaknesses of each method (small data bad for forests, good for SVM). I could maybe just repeat this at the end of each notebook focusing on the method of the week (saying not to worry about the other methods just yet) and expanding as the weeks go on. An example of this kind of approach is pg. 153 of intro to stats learning.

Advantages¶

  • Kernels allow us to constuct hyperplanes in high dimensional spaces with tractable computation6

  • Provided hyper-parameters are carefully selected, compared to other ML methods, SVM's are not as effected by an over-representation of data in the training phase due to over-parameterization or over-fitting (Gonzalez-Vellon et al., 2003; Shoeb et al., 2004).

  • SVMs always converge on the same answer given identical data and hyper-parameters.

Disadvantages¶

"Perhaps the biggest limitation of the support vector approach lies in choice of the kernel. Once the kernel is fixed, SVM classifiers have only one user-chosen parameter (the error penalty), but the kernel is a very big rug under which to sweep parameters. Some work has been done on limiting kernels using prior knowledge (Scholkopf et al., 1998a; Burges, 1998), but the best choice of kernel for a given problem is still a research issue." Burges (1998)

  • Non-linear SVM's have a problem with speed with very large datasets, although the use of batch algorithms has improved this somewhat11

  • Discrete data presents a problem for the SVM method we have discussed, however there are alternative implimentations in the litriture10

  • They are comparatively sensitive to hyperparamters, there can be quite a variation in score depending on setup.

However, we have only been looking at the best performing models, looking at all models during training shnow use how sensitive models are to changes in the hyperparamters. Below we can see that SVM's in general are pretty sensitive to their paramter settings.

TODO

How robust is SVM to changes in the random seed.

References¶

  1. https://scikit-learn.org/stable/datasets/toy_dataset.html
  2. Warwick J Nash, Tracy L Sellers, Simon R Talbot, Andrew J Cawthorn and Wes B Ford (1994) "The Population Biology of Abalone (Haliotis species) in Tasmania. I. Blacklip Abalone (H. rubra) from the North Coast and Islands of Bass Strait", Sea Fisheries Division, Technical Report No. 48 (ISSN 1034-3288)
  3. Géron, A. (2017). Hands-on machine learning with Scikit-Learn and TensorFlow: concepts, tools, and techniques to build intelligent systems. " O'Reilly Media, Inc.".
  4. Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT press.
  5. James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning. Vol. 112. New York: springer, 2013.
  6. http://contrib.scikit-learn.org/imbalanced-learn/stable/introduction.html
  7. https://beckernick.github.io/oversampling-modeling/
  8. Mani, I., & Zhang, I. (2003, August). kNN approach to unbalanced data distributions: a case study involving information extraction. In Proceedings of workshop on learning from imbalanced datasets (Vol. 126).
  9. Raschka, Sebastian, and Vahid Mirjalili. Python Machine Learning, 2nd Ed. Packt Publishing, 2017.
  10. Lemaître, G., Nogueira, F., & Aridas, C. K. (2017). Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. The Journal of Machine Learning Research, 18(1), 559-563.
  11. Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, (3), 408-421.
  12. Tomek, I. (1976). Two modifications of CNN. IEEE Trans. Systems, Man and Cybernetics, 6, 769-772.
  13. Batista, G. E., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD explorations newsletter, 6(1), 20-29.
  14. https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html